Initialise by loading data.
load("druguse.RData") #Load dataUsing ggplot() and geom_bar() create a plot to compare drug use level within each country and rename the labels accordingly.
histcountryuselevel <- ggplot(druguse,aes(x = druguse$country, fill = druguse$UseLevel)) + geom_bar() #Create bar plot
histcountryuselevel + xlab("Country") + ylab("Use Level Count") + labs(fill='Use Level') #Assign x, y and legend labelsReplacing the x with the gender index of the druguse data create another plot comparing drug use by gender.
histcountrygender <- ggplot(druguse,aes(x = druguse$gender, fill = druguse$UseLevel)) + geom_bar() #Create bar plot
histcountrygender+ xlab("Gender") + ylab("Use Level Count") + labs(fill='Use Level') #Assign x, y and legend labelsUsing summary() we can show summary statistics about the data in druguse and str() the structure of the data frame.
summary(druguse) #Summary of the data frame## agegroup gender education country
## 18-24:643 female:942 Min. :-2.435910 Australia : 54
## 25-34:481 male :943 1st Qu.:-0.611130 Canada : 87
## 35-44:356 Median :-0.059210 NewZealand : 5
## 45-54:294 Mean :-0.003806 Other : 118
## 55-64: 93 3rd Qu.: 0.454680 RepublicofIreland: 20
## 65+ : 18 Max. : 1.984370 UK :1044
## USA : 557
## ethnicity neuroticism extraversion
## Asian : 26 Min. :-3.464360 Min. :-3.273930
## Black : 33 1st Qu.:-0.678250 1st Qu.:-0.695090
## Mixed-Black/Asian: 3 Median : 0.042570 Median : 0.003320
## Mixed-White/Asian: 20 Mean : 0.000047 Mean :-0.000163
## Mixed-White/Black: 20 3rd Qu.: 0.629670 3rd Qu.: 0.637790
## Other : 63 Max. : 3.273930 Max. : 3.273930
## White :1720
## opentoexperience agreeableness conscientiousness
## Min. :-3.273930 Min. :-3.464360 Min. :-3.464360
## 1st Qu.:-0.717270 1st Qu.:-0.606330 1st Qu.:-0.652530
## Median :-0.019280 Median :-0.017290 Median :-0.006650
## Mean :-0.000534 Mean :-0.000245 Mean :-0.000386
## 3rd Qu.: 0.723300 3rd Qu.: 0.760960 3rd Qu.: 0.584890
## Max. : 2.901610 Max. : 3.464360 Max. : 3.464360
##
## impulsiveness sensation caffeine chocolate
## Min. :-2.555240 Min. :-2.078480 Min. :0.000 Min. :0.000
## 1st Qu.:-0.711260 1st Qu.:-0.525930 1st Qu.:5.000 1st Qu.:5.000
## Median :-0.217120 Median : 0.079870 Median :6.000 Median :5.000
## Mean : 0.007216 Mean :-0.003292 Mean :5.484 Mean :5.107
## 3rd Qu.: 0.529750 3rd Qu.: 0.765400 3rd Qu.:6.000 3rd Qu.:6.000
## Max. : 2.901610 Max. : 1.921730 Max. :6.000 Max. :6.000
##
## nicotine alcohol amphetamine amylnitrite
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:4.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :3.000 Median :5.000 Median :0.000 Median :0.0000
## Mean :3.201 Mean :4.635 Mean :1.341 Mean :0.6069
## 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.0000
##
## benzodiaz cannabis cocaine crack
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :3.000 Median :0.000 Median :0.0000
## Mean :1.465 Mean :2.989 Mean :1.161 Mean :0.2976
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.0000
##
## ecstasy heroin ketamine legalhighs
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.0000 Median :0.000
## Mean :1.314 Mean :0.374 Mean :0.5692 Mean :1.356
## 3rd Qu.:3.000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:3.000
## Max. :6.000 Max. :6.000 Max. :6.0000 Max. :6.000
##
## LSD methadone mushrooms volatiles
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :1.062 Mean :0.8265 Mean :1.187 Mean :0.4334
## 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :6.000 Max. :6.0000 Max. :6.000 Max. :6.0000
##
## any severity UseLevel
## FALSE: 195 Min. : 0.00 low : 849
## TRUE :1690 1st Qu.: 5.00 high:1036
## Median :16.00
## Mean :18.19
## 3rd Qu.:29.00
## Max. :64.00
##
str(druguse) #Show structure of the data frame## 'data.frame': 1885 obs. of 33 variables:
## $ agegroup : Factor w/ 6 levels "18-24","25-34",..: 1 1 2 2 3 1 1 2 3 1 ...
## $ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 1 2 1 1 2 ...
## $ education : num 0.455 -0.611 1.164 1.164 0.455 ...
## $ country : Factor w/ 7 levels "Australia","Canada",..: 6 7 6 6 6 7 7 6 6 6 ...
## $ ethnicity : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ neuroticism : num -0.1488 -0.921 1.0212 -1.5508 -0.0519 ...
## $ extraversion : num -0.948 1.741 0.962 1.114 0.962 ...
## $ opentoexperience : num -1.4242 0.5833 -0.9763 -0.0193 -1.119 ...
## $ agreeableness : num -0.4532 -1.3429 -0.6063 -0.0173 0.9416 ...
## $ conscientiousness: num 0.26 1.134 -0.143 1.462 1.631 ...
## $ impulsiveness : num 0.53 0.53 -1.38 0.193 -1.38 ...
## $ sensation : num -0.8464 1.2247 -1.1808 0.0799 -2.0785 ...
## $ caffeine : num 4 6 6 6 5 6 5 6 6 5 ...
## $ chocolate : num 5 5 6 5 6 6 4 6 6 6 ...
## $ nicotine : num 2 6 5 4 0 6 4 3 0 4 ...
## $ alcohol : num 4 4 5 5 3 5 3 3 5 5 ...
## $ amphetamine : num 0 3 0 0 0 0 2 0 0 4 ...
## $ amylnitrite : num 0 0 2 0 0 0 3 0 0 2 ...
## $ benzodiaz : num 0 2 0 0 0 0 3 0 0 3 ...
## $ cannabis : num 2 5 3 2 0 6 5 0 1 4 ...
## $ cocaine : num 0 3 2 0 0 0 2 0 0 2 ...
## $ crack : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ecstasy : num 0 0 0 0 0 0 3 0 0 4 ...
## $ heroin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ketamine : num 0 0 0 0 0 0 0 0 0 4 ...
## $ legalhighs : num 0 3 2 4 0 5 3 0 0 3 ...
## $ LSD : num 0 0 0 0 0 0 3 0 0 2 ...
## $ methadone : num 0 0 0 0 0 2 4 0 0 0 ...
## $ mushrooms : num 0 3 0 0 0 3 3 0 0 2 ...
## $ volatiles : num 0 0 0 0 0 0 0 0 0 0 ...
## $ any : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 1 2 2 2 2 2 ...
## $ severity : num 4 25 14 10 0 22 35 3 1 34 ...
## $ UseLevel : Factor w/ 2 levels "low","high": 1 2 2 1 1 2 2 1 1 2 ...
Using pairs we can plot all the scatter plots possible with each pair of indicies to show any relationships between them visually.
pairs(druguse, gap=0)#apply pairs to data frame We can better visualise the relationships shown by pairs by calculating the correlation matrix. Converting all of the data frame to numerics using lapply and as.numeric so that we can create a correlation matrix of the data using cor. We reshape the matrix using melt from the package reshape2 so that we can plot a heatmap of the correlogram in ggplot.
druguse_num <- data.frame(lapply(druguse,as.numeric)) #convert all columns to a numeric
correlation <- cor(druguse_num) #calculate correlations
melted_cor <- melt(correlation) #use melt to reshape correlation matrix into a dataframe for ggplot
#create a heatmap of the correlation matrix in ggplot
corplot <- ggplot(data=melted_cor, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + xlab("") + ylab("") +ggtitle("Correlation Heatmap") #create plot using geom_tile and assaign labels and title
corplot + geom_text(aes(Var2, Var1, label = round(value,2)),
color = "black", size = 2) + theme(axis.text.x = element_text(angle = 90),
plot.title = element_text(hjust = 0.5)) + scale_fill_gradientn(colours = rainbow(20)) #modify the text angle, colour and add corelation values onto each tileThis allows us to easily visualise the relationships in the data so that we can determine what to use as a predictor. We can order the correlation values and remove duplicates and correlations with its self. Then use head and tail to show highest and lowest correlation pairs.
melted_cor_ord1 <- melted_cor[order(melted_cor$value ,decreasing = T),] #order correlations
melted_cor_ord2 <- melted_cor_ord1[seq(34,nrow(melted_cor_ord1),2),] #remove duplicates and correlation with self
head(melted_cor_ord2, n=10) #show most positively correlated ## Var1 Var2 value
## 1056 UseLevel severity 0.8279854
## 758 severity ecstasy 0.7517185
## 660 UseLevel cannabis 0.7464855
## 659 severity cannabis 0.7458712
## 692 severity cocaine 0.7287190
## 560 severity amphetamine 0.7244818
## 857 severity legalhighs 0.7068349
## 956 severity mushrooms 0.7036681
## 890 severity LSD 0.6713733
## 887 mushrooms LSD 0.6686274
tail(melted_cor_ord2, n=10) #show most negatively correlated## Var1 Var2 value
## 29 mushrooms agegroup -0.3267421
## 12 sensation agegroup -0.3275861
## 308 impulsiveness conscientiousness -0.3351333
## 33 UseLevel agegroup -0.3768257
## 23 ecstasy agegroup -0.3822633
## 175 conscientiousness neuroticism -0.3910884
## 32 severity agegroup -0.4072982
## 26 legalhighs agegroup -0.4100335
## 172 extraversion neuroticism -0.4310511
## 20 cannabis agegroup -0.4370538
Using rcorr from Hmisc we can calculate the p values for each correlation pair to determine if the correlation is statistically significant at a given significance level. We can then similarly to the correlogram create a heat map of the p values using ggplot.
#use rcorr from Hmisc to calculate p values for the correlation of all elements and extract the p values
cor_pvalues <- rcorr(as.matrix(druguse_num))$P
#melt to reshape for ggplot
melted_cor_pvalues <- melt(cor_pvalues)
melted_cor_pvalues[is.na(melted_cor_pvalues)] <- 0 #replace NAs with 0
#create a heatmap of the correlation pvalues matrix in ggplot
corplot <- ggplot(data=melted_cor_pvalues, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + xlab("") + ylab("") +ggtitle("Correlation Pvalues Heatmap")
corplot + geom_text(aes(Var2, Var1, label = round(value,2)),
color = "black", size = 2) + theme(axis.text.x = element_text(angle = 90),
plot.title = element_text(hjust = 0.5)) + scale_fill_gradientn(colours = c("blue","red")) Using this we can clearly see the high p values and thus which are not significantly related. We can extract the correlation pairs relating to UseLevel only to see its relationship with the predictors in the data set.
cor_UseLevel <- melted_cor_ord2[(melted_cor_ord2$Var1 == "UseLevel"),] #keep only correlations with UseLevel
cor_UseLevel #print## Var1 Var2 value
## 1056 UseLevel severity 0.82798538
## 660 UseLevel cannabis 0.74648553
## 759 UseLevel ecstasy 0.65682304
## 858 UseLevel legalhighs 0.61336785
## 957 UseLevel mushrooms 0.61169870
## 693 UseLevel cocaine 0.59065875
## 561 UseLevel amphetamine 0.58671220
## 891 UseLevel LSD 0.56739033
## 495 UseLevel nicotine 0.53958655
## 627 UseLevel benzodiaz 0.49578385
## 396 UseLevel sensation 0.47205494
## 825 UseLevel ketamine 0.40937284
## 924 UseLevel methadone 0.40528902
## 1023 UseLevel any 0.37523210
## 264 UseLevel opentoexperience 0.37108294
## 594 UseLevel amylnitrite 0.34200580
## 363 UseLevel impulsiveness 0.33004475
## 990 UseLevel volatiles 0.32468935
## 726 UseLevel crack 0.31812696
## 66 UseLevel gender 0.31716181
## 792 UseLevel heroin 0.31697286
## 198 UseLevel neuroticism 0.13967826
## 165 UseLevel ethnicity 0.08650985
## 429 UseLevel caffeine 0.06388204
## 132 UseLevel country 0.00788457
## 528 UseLevel alcohol -0.01672327
## 231 UseLevel extraversion -0.03882249
## 462 UseLevel chocolate -0.08955852
## 297 UseLevel agreeableness -0.18077992
## 99 UseLevel education -0.25774412
## 330 UseLevel conscientiousness -0.31499336
## 33 UseLevel agegroup -0.37682565
We can also do this for the p values to determine which correlation pairs containing UseLevel are statistically significant.
#order p values highest to lowest
melted_cor_pvalues_ord <- melted_cor_pvalues[order(melted_cor_pvalues$value ,decreasing = T),]
head(melted_cor_pvalues_ord, 10)#show highest p values ## Var1 Var2 value
## 141 agreeableness ethnicity 0.9937756
## 269 ethnicity agreeableness 0.9937756
## 313 alcohol conscientiousness 0.9929622
## 505 conscientiousness alcohol 0.9929622
## 311 chocolate conscientiousness 0.9876450
## 439 conscientiousness chocolate 0.9876450
## 162 volatiles ethnicity 0.9818996
## 962 ethnicity volatiles 0.9818996
## 245 chocolate opentoexperience 0.9571145
## 437 opentoexperience chocolate 0.9571145
tail(melted_cor_pvalues_ord, 10)#show lowest p values## Var1 Var2 value
## 1080 heroin UseLevel 0
## 1081 ketamine UseLevel 0
## 1082 legalhighs UseLevel 0
## 1083 LSD UseLevel 0
## 1084 methadone UseLevel 0
## 1085 mushrooms UseLevel 0
## 1086 volatiles UseLevel 0
## 1087 any UseLevel 0
## 1088 severity UseLevel 0
## 1089 UseLevel UseLevel 0
We can then choose a significance level to test at in this case 5% to show which predictors are unlikely to be correlated with UseLevel.
#find p values relating to use level
pvalues_cor_uselevel <- melted_cor_pvalues_ord[melted_cor_pvalues_ord$Var1 == "UseLevel",] #extract just those relating to UseLevel
notsignificant <- pvalues_cor_uselevel[pvalues_cor_uselevel$value >= 0.05,] #not significant at 5% significance level
pvalues_cor_uselevel #print## Var1 Var2 value
## 132 UseLevel country 7.322761e-01
## 528 UseLevel alcohol 4.680617e-01
## 231 UseLevel extraversion 9.197753e-02
## 429 UseLevel caffeine 5.528074e-03
## 165 UseLevel ethnicity 1.695587e-04
## 462 UseLevel chocolate 9.877888e-05
## 198 UseLevel neuroticism 1.127642e-09
## 297 UseLevel agreeableness 2.664535e-15
## 33 UseLevel agegroup 0.000000e+00
## 66 UseLevel gender 0.000000e+00
## 99 UseLevel education 0.000000e+00
## 264 UseLevel opentoexperience 0.000000e+00
## 330 UseLevel conscientiousness 0.000000e+00
## 363 UseLevel impulsiveness 0.000000e+00
## 396 UseLevel sensation 0.000000e+00
## 495 UseLevel nicotine 0.000000e+00
## 561 UseLevel amphetamine 0.000000e+00
## 594 UseLevel amylnitrite 0.000000e+00
## 627 UseLevel benzodiaz 0.000000e+00
## 660 UseLevel cannabis 0.000000e+00
## 693 UseLevel cocaine 0.000000e+00
## 726 UseLevel crack 0.000000e+00
## 759 UseLevel ecstasy 0.000000e+00
## 792 UseLevel heroin 0.000000e+00
## 825 UseLevel ketamine 0.000000e+00
## 858 UseLevel legalhighs 0.000000e+00
## 891 UseLevel LSD 0.000000e+00
## 924 UseLevel methadone 0.000000e+00
## 957 UseLevel mushrooms 0.000000e+00
## 990 UseLevel volatiles 0.000000e+00
## 1023 UseLevel any 0.000000e+00
## 1056 UseLevel severity 0.000000e+00
## 1089 UseLevel UseLevel 0.000000e+00
notsignificant #print## Var1 Var2 value
## 132 UseLevel country 0.73227613
## 528 UseLevel alcohol 0.46806166
## 231 UseLevel extraversion 0.09197753
Using plotly we can create a interactive 3d scatter plot with symbols for each UseLevel and 3 significant predictors for use level on the axises. The plot appears to show two groups suggesting that what we infered previously from the correlations and p values is true.
#create new data frame as all numerical except uselevel otherwise we can't add a legend
druguse_num_p <- druguse_num
druguse_num_p$UseLevel <- druguse$UseLevel
#create 3d plot of non significant predictors
sp3 <- plot_ly(druguse_num_p, x = ~nicotine, y = ~sensation, z= ~opentoexperience, mode = 'markers',
color = I('blue'), symbol = ~UseLevel, symbols = c('circle','o')) %>%#create plot and set x,y,z
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Nicotine'),
yaxis = list(title = 'Sensation'),
zaxis = list(title = 'Open to experience')), title="3D plot of 3 significant predictors of UseLevel") #set axis labels
sp3Similarly to before we can extract the p values for severity to show which predictors are statistically significant for predicting severity.
pvalues_cor_severity <- melted_cor_pvalues_ord[melted_cor_pvalues_ord$Var1 == "severity",]#find p values relating to severity
pvalues_cor_severity #print## Var1 Var2 value
## 131 severity country 3.624009e-01
## 230 severity extraversion 1.618099e-01
## 527 severity alcohol 1.128850e-01
## 428 severity caffeine 1.795644e-03
## 164 severity ethnicity 1.184769e-04
## 461 severity chocolate 1.181627e-04
## 197 severity neuroticism 8.881784e-16
## 32 severity agegroup 0.000000e+00
## 65 severity gender 0.000000e+00
## 98 severity education 0.000000e+00
## 263 severity opentoexperience 0.000000e+00
## 296 severity agreeableness 0.000000e+00
## 329 severity conscientiousness 0.000000e+00
## 362 severity impulsiveness 0.000000e+00
## 395 severity sensation 0.000000e+00
## 494 severity nicotine 0.000000e+00
## 560 severity amphetamine 0.000000e+00
## 593 severity amylnitrite 0.000000e+00
## 626 severity benzodiaz 0.000000e+00
## 659 severity cannabis 0.000000e+00
## 692 severity cocaine 0.000000e+00
## 725 severity crack 0.000000e+00
## 758 severity ecstasy 0.000000e+00
## 791 severity heroin 0.000000e+00
## 824 severity ketamine 0.000000e+00
## 857 severity legalhighs 0.000000e+00
## 890 severity LSD 0.000000e+00
## 923 severity methadone 0.000000e+00
## 956 severity mushrooms 0.000000e+00
## 989 severity volatiles 0.000000e+00
## 1022 severity any 0.000000e+00
## 1055 severity severity 0.000000e+00
## 1088 severity UseLevel 0.000000e+00
We can create a 2D interactive histogram contour plot of severity and age group to show how they are distributed relative to one another.
hc1 <- subplot(#use subplot so we can plot the histograms along with the contours
plot_ly(data=druguse,x =~agegroup, type = 'histogram', name = 'Age group'),
plotly_empty(),
plot_ly(druguse,x =~agegroup, y =~severity, type = 'histogram2dcontour', showscale = T, name = 'Contour'),
plot_ly(druguse,y =~severity, type = 'histogram', name = 'Severity'),
nrows = 2, heights = c(0.2, 0.8), widths = c(0.8, 0.2),
shareX = TRUE, shareY = TRUE, titleX = FALSE, titleY = FALSE
)
hc2 <- layout(hc1, showlegend = F, xaxis = list(title = 'Age group'), yaxis = list(title = 'Severity'), title = 'Histogram contour of severity and age group') #add titles and axises labels
hc2 #show plotpar(mfrow=c(4,4))#set 4x4 space in plotting device
for (n in 1:16){#loop for first 16 predictors
t <- sprintf('Histogram of %s', colnames(druguse_num)[n])#create names for title
hist(druguse_num[,n], main = t, xlab = sprintf('%s', colnames(druguse_num)[n]))#create histogram
}Create validation and training datasets.
druguse_predictors <- druguse[c(1:16,33)] #Data set of just the predictors
druguse_train <- druguse_predictors[1:1400,] #Training dataset
druguse_val <- druguse_predictors[1401:1885,] #Test datasetCreate a logistic regression model by using glm and specifing link='logit' for a logistic regression. Output a summary of the model
model <- glm(UseLevel ~ ., family=binomial(link='logit'), data=druguse_train)#Create a logistic regression classifier using all predictors to predict UseLevel
summary(model) #Summarise the model##
## Call:
## glm(formula = UseLevel ~ ., family = binomial(link = "logit"),
## data = druguse_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3880 -0.3865 0.1221 0.4357 2.8387
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.84266 1.13278 -0.744 0.45694
## agegroup25-34 0.26901 0.24571 1.095 0.27359
## agegroup35-44 -0.17433 0.25605 -0.681 0.49598
## agegroup45-54 -0.65564 0.27445 -2.389 0.01690 *
## agegroup55-64 -0.98729 0.40893 -2.414 0.01576 *
## agegroup65+ -16.81790 502.60332 -0.033 0.97331
## gendermale 0.93652 0.18530 5.054 4.33e-07 ***
## education -0.25677 0.09947 -2.581 0.00984 **
## countryCanada -0.81196 0.62733 -1.294 0.19556
## countryNewZealand -0.43152 1.48226 -0.291 0.77096
## countryOther -1.00621 0.61469 -1.637 0.10164
## countryRepublicofIreland -1.61533 0.95965 -1.683 0.09233 .
## countryUK -2.10637 0.51748 -4.070 4.69e-05 ***
## countryUSA 0.07998 0.54650 0.146 0.88364
## ethnicityBlack -0.46926 1.21613 -0.386 0.69960
## ethnicityMixed-Black/Asian 13.17401 2399.54491 0.005 0.99562
## ethnicityMixed-White/Asian 1.34439 1.34799 0.997 0.31860
## ethnicityMixed-White/Black 0.61315 1.13218 0.542 0.58812
## ethnicityOther 0.41618 0.97087 0.429 0.66816
## ethnicityWhite 1.06211 0.85269 1.246 0.21291
## neuroticism -0.06843 0.10697 -0.640 0.52234
## extraversion -0.13871 0.11164 -1.242 0.21406
## opentoexperience 0.56631 0.10678 5.304 1.14e-07 ***
## agreeableness -0.09095 0.09740 -0.934 0.35041
## conscientiousness -0.31084 0.10487 -2.964 0.00304 **
## impulsiveness -0.01198 0.11249 -0.107 0.91517
## sensation 0.72043 0.13213 5.453 4.96e-08 ***
## caffeine 0.07957 0.08073 0.986 0.32431
## chocolate -0.09428 0.08338 -1.131 0.25815
## nicotine 0.48841 0.03924 12.447 < 2e-16 ***
## alcohol -0.06487 0.06890 -0.942 0.34642
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1923.39 on 1399 degrees of freedom
## Residual deviance: 894.64 on 1369 degrees of freedom
## AIC: 956.64
##
## Number of Fisher Scoring iterations: 15
Looking at the summary of the model we can use the coefficents of each predictor in the model to determine whether a smoker is more or less likely to have a high use level. The coefficent of nicotine is 0.48841 which is positive suggesting smokers are more likely and as the p value is low this suggests its statistically significant also. Chocolate has a cofficent of -0.09428 which is negative suggesting they are less likely to have a high use level however it has a high p value suggesting this might not be statistically significant and in fact there is no relationship.
Create predictions using the predict function and if its greater than 0.5 assaign it the value 1 for high and 0 for below 0.5 indicating low. Then calculate the errors of each type, correct predictions and create a table displaying them.
pp=predict(model, newdata=druguse_val)#Create predictions
mypreds=ifelse(pp>0.5, 1,0) # predict High if pp > 0.5 and predict Low otherwise.
#calculate errors of each type
TP=sum(mypreds==1 & druguse_val$UseLevel=="high")
FP=sum(mypreds==1 & druguse_val$UseLevel=="low")
TN=sum(mypreds==0 & druguse_val$UseLevel=="low")
FN=sum(mypreds==0 & druguse_val$UseLevel=="high")
#make table
results <- matrix(c(TP,FP,FN,TN), ncol=2, byrow=TRUE)
#assign names
rownames(results) <- c("Predict High","Predict Low")
colnames(results) <- c("Actually High", "Actaully Low")
results <- as.table(results)
results #print## Actually High Actaully Low
## Predict High 207 26
## Predict Low 51 201
Determine the accuracy.
Accuracy <- (TP+TN)/(TP+TN+FN+FP); Accuracy #calculate accuracy and print## [1] 0.8412371
Create prediction function and use performance to create the ROC curve.
pr <- prediction(pp, as.numeric(druguse_val$UseLevel)) # create the function
prf <- performance(pr, measure = "tpr", x.measure = "fpr") # find the performance
plot(prf, main="ROC of logistic regression model") # plot itWe can also calculate the area under the ROC curve the AUC as a quantative measure of performance of the model.
#calculate area under the curve
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc ## [1] 0.926903
The ROC curve suggests it is a good model as the closer the curve is to the top left corner the better the model. As this shows it has a high True positive rate with a low false postive rate which is what we want in a good model. The AUC is the area under the curve as this is close to on this also suggests its a good predictor as an AUC of 1 is a perfect model.
We can split the model into 10 equal size pieces and train on 9 of them and validate on 1 until weve used all 10 folds as the validation data. Averaging the accuracy over the 10 iterations we can estimate the accuracy of the model on a new data set.
folds <- cut(seq(1,nrow(druguse_predictors)),breaks=10,labels=FALSE)#cut data into 10 blocks
Accuracy <- c()#create place to store accuracies in memory
for(i in 1:10){
testIndexes <- which(folds==i,arr.ind=TRUE)#create indicies for test and training data
testingData <- druguse_predictors[testIndexes, ]#create test data
trainingData <- druguse_predictors[-testIndexes, ]#create training data
model <- glm(UseLevel ~ ., family=binomial(link='logit'), data=trainingData)#Create a logistic regression classifier using all predictors to predict UseLevel
pp=predict(model, newdata=testingData)#Create predictions
mypreds=ifelse(pp>0.5, 1,0) # predict High if pp > 0.5 and predict Low otherwise.
#calculate errors and correct predictions
TP=sum(mypreds==1 & testingData$UseLevel=="high")
FP=sum(mypreds==1 & testingData$UseLevel=="low")
TN=sum(mypreds==0 & testingData$UseLevel=="low")
FN=sum(mypreds==0 & testingData$UseLevel=="high")
Accuracy[i] <- (TP+TN)/(TP+TN+FN+FP)#store accuracy
}
summary(Accuracy)#use summary to see the average accuracy and other metrics## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7619 0.8320 0.8409 0.8372 0.8537 0.8677
This provides a look at how the model will perform in general with the average and metrics such as max, min and quartiles to show how much the performance could vary in new data sets.
First we create a numerical version of the data set removing predictors with a high p value that we discovered in our data analysis. We also scale the data as it will help improve the accuracies of our classifiers and also how fast methods like neural nets converge.
#remove predictors with high p value
druguse_num <- druguse_num[,c(1:3,5,6,8:15)]
#scale data using scale function and recreate dataframe
druguse_num_s <- data.frame(lapply(druguse_num, scale))
#add drug UseLevel column to dataframe
druguse_num_s$UseLevel <- druguse$UseLevelWe will now use bootstrapping to determine the performance of the classifier on a new dataset. We randomly select a sample with replacement for training and then use rest of the data set to validate on.
set.seed(1729)#set seed to Hardy-Ramanujan number so results using bootstrapping are consistentThe naive bayes classifier assumes the predictors are conditionally independent our data analysis suggests his may not be true thus affecting the performance of this classifier. Using bayes theorem we calculate the relative probabilities of a new observation vector being of class high or low given the values in the training set and classify based on which has the highest probability.
Accuracybayes <- matrix(NA,nrow=10,ncol=1) #create place to store data
colnames(Accuracybayes) <- "bayes"#name column
for(i in 1:10){
bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data indicies
modelbayes <- naiveBayes(UseLevel ~ ., data=druguse_num_s[bootsample,])#Create a naive bayes classifier for uselevel
pbayes <- predict(modelbayes, newdata=druguse_num_s[outofbag,])#Create predictions
Accuracybayes[i,1] <- sum(pbayes==druguse_num_s$UseLevel[outofbag])/length(outofbag)#calculate accuracy
}
summary(Accuracybayes)#sumarise accuracy## bayes
## Min. :0.7850
## 1st Qu.:0.8069
## Median :0.8197
## Mean :0.8213
## 3rd Qu.:0.8333
## Max. :0.8580
Naive bayes provides a high mean accuracy however other techniques that make less assumptions may offer higher accuracies.
Linear discriminanent analysis assumes a gaussian distribution or in this multinomial case a multivariate Gaussian distribution with a common covariance matrix for each class. It uses the training data to calculate the parameters of the multinomial Gaussian for each class by maximum likelihood. The decision boundaries are the the lines perpendicular to the lines joining the centroids of each classes gaussian. This classifies a new data point in which ever class has the highest probability essentially as we have assumed the covariance matricies are the same.
Accuracylda <- matrix(NA,nrow=10,ncol=1) #create place to store data
colnames(Accuracylda) <-"LDA"#name column
for(i in 1:10){
bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data indicies
modellda <- lda(UseLevel ~ ., data=druguse_num_s[bootsample,])#Create a linear discrimanent analysis classifier using predictors to predict UseLevel
pplda=predict(modellda, newdata=druguse_num_s[outofbag,])#Create predictions
Accuracylda[i] <- sum(pplda$class==druguse_num_s$UseLevel[outofbag])/length(outofbag)#calculate accuracy
}
summary(Accuracylda)#summarise accuracy## LDA
## Min. :0.8131
## 1st Qu.:0.8307
## Median :0.8378
## Mean :0.8356
## 3rd Qu.:0.8432
## Max. :0.8487
LDA offers a higher accuracy than Naive Bayes we do have to assume a gaussian distribution but our data analysis suggests that most predictors are gaussian distributed. Thus suggesting that a multinomial gaussian distribution for predicting use level may be true. The range of accuracies is also lower.
MDA assumes a Gaussian distribution like LDA but splits each class into subclasses that are assumed to be Gaussian and predicts the parameters of these Gaussians using maximum likelihood estimates. Splitting the classes into subclasses can allow MDA to create better decision boundaries than LDA.
Accuracymda <- matrix(NA, nrow = 10, ncol = 5)#create place to store accuracies
colnames(Accuracymda) <- sprintf("MDA subclasses=%s", 1:5)#name columns according to number of subclasses
for (r in 1:5){#test different numbers of subclasses
for(i in 1:10){
bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data indicies
modelmda <- mda(UseLevel ~ ., data=druguse_num_s[bootsample,], subclasses = r)#Create a mda classifier to predict UseLevel with r subclasses
pmda <- predict(modelmda, newdata=druguse_num_s[outofbag,])#Create predictions
Accuracymda[i,r] <- sum(pmda==druguse_num_s$UseLevel[outofbag])/length(outofbag)#calculate accuracy
}
}
summary(Accuracymda)#summarise accuracy## MDA subclasses=1 MDA subclasses=2 MDA subclasses=3 MDA subclasses=4
## Min. :0.8152 Min. :0.8084 Min. :0.8097 Min. :0.8009
## 1st Qu.:0.8216 1st Qu.:0.8236 1st Qu.:0.8237 1st Qu.:0.8130
## Median :0.8315 Median :0.8332 Median :0.8347 Median :0.8348
## Mean :0.8281 Mean :0.8304 Mean :0.8341 Mean :0.8296
## 3rd Qu.:0.8341 3rd Qu.:0.8360 3rd Qu.:0.8474 3rd Qu.:0.8429
## Max. :0.8379 Max. :0.8443 Max. :0.8500 Max. :0.8605
## MDA subclasses=5
## Min. :0.8072
## 1st Qu.:0.8199
## Median :0.8268
## Mean :0.8278
## 3rd Qu.:0.8356
## Max. :0.8447
MDA has a similar accuracy to LDA on average but a higher range suggesting splitting into subclasses has caused the model to overfit to the training data in some cases. 3 subclasses seems to provide the best accuracy but the accuracies are similar in each case so this may not be significant however the range of accuracies increases with the number of subclasses which suggests overfitting.
SVMs find a hyperplane to seperate the two classes. It finds this by maximising the perpendicular distance between the hyperplane and the data points of the two classes. One advantage of SVMs is you can change the shape of the hyperplane by using a kernel to allow for non-linear hyper planes. In this case we test polynomial, radial and sigmoid kernels as well as linear to see if they offer a better accuracy.
Accuracysvms <- matrix(NA, nrow = 10, ncol = 4)#allocate memory to store accuracies
kernels <- c("linear", "polynomial", "radial", "sigmoid")#store kernel types so models can be created in for loop
colnames(Accuracysvms) <- sprintf("SVM %s", kernels) #name columns of accuracies matrix
for (r in 1:4){#repeat for each kernal type
for (i in 1:10){#repeat 10 times
bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample inicies
outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data
modelsvms <- svm(UseLevel~.,data = druguse_num_s[bootsample,], kernel=kernels[r])#create svm with kernel type kernels[r]
psvms <- predict(modelsvms, newdata=druguse_num_s[outofbag,])#predict on validation data
Accuracysvms[i,r] <- sum(psvms==druguse_num_s$UseLevel[outofbag])/length(outofbag)#store accuracy
}
}
summary(Accuracysvms)#show summary of accuracies## SVM linear SVM polynomial SVM radial SVM sigmoid
## Min. :0.8088 Min. :0.8039 Min. :0.8084 Min. :0.7556
## 1st Qu.:0.8293 1st Qu.:0.8077 1st Qu.:0.8199 1st Qu.:0.7611
## Median :0.8341 Median :0.8138 Median :0.8311 Median :0.7739
## Mean :0.8361 Mean :0.8152 Mean :0.8293 Mean :0.7738
## 3rd Qu.:0.8424 3rd Qu.:0.8197 3rd Qu.:0.8374 3rd Qu.:0.7820
## Max. :0.8728 Max. :0.8302 Max. :0.8502 Max. :0.8000
To compare each kernel we can create a boxplot of the accuracies of each kernel type.
bp.svms <- boxplot(Accuracysvms, horizontal=T, main="SVM kernel accuracies", ylab="Kernel", xlab="Accuracy", col=c(2,5,7,8)) #create boxplot comparing kernelsOut of the 4 kernels the linear kernel performs the best on average but offers similar performance to LDA.
We can use a neural network to predict the relationship between each predictor and use level. A neural network uses the training data to predict a function relating each of the predictors to use level. It does this by adjusting the weights of each node in the hidden layer at each iteration till it converges to a particular set of weights for each node. We can create a neural net with 10 nodes in the hidden layer and train using nnet on training data.
#create validation and training data from scaled data
druguse_train_num_s <- druguse_num_s[1:1400,]
druguse_val_num_s <- druguse_num_s[1401:1885,]
#create a neural network
nnmodel <- nnet(UseLevel~. , data=druguse_train_num_s, size = 10, maxit=1000) ## # weights: 151
## initial value 1004.917054
## iter 10 value 503.141443
## iter 20 value 455.570938
## iter 30 value 416.521479
## iter 40 value 391.345281
## iter 50 value 374.640870
## iter 60 value 349.889873
## iter 70 value 328.538846
## iter 80 value 302.407847
## iter 90 value 280.595727
## iter 100 value 271.184208
## iter 110 value 267.644844
## iter 120 value 267.076157
## iter 130 value 267.026594
## iter 140 value 267.021381
## iter 150 value 267.017322
## iter 160 value 267.016200
## final value 267.016001
## converged
Calculate the accuracy and create a confusion table of the predictions on the validation dataset to evaluate the performance.
#make predictions and create confusion matrix
nnpredn = predict(nnmodel,druguse_val_num_s,type="class")
tablenn <- table(nnpredn, druguse_val_num_s$UseLevel)
#calculate correct and incorrect classifications of each type
TP=tablenn[2]
FP=tablenn[1]
TN=tablenn[3]
FN=tablenn[4]
#calculate accuracy
Accuracynn <- (TP+TN)/(TP+TN+FN+FP)
#print
Accuracynn## [1] 0.771134
tablenn##
## nnpredn low high
## high 55 202
## low 172 56
The neural network has a lower accuracy than previous methods. We can plot the model using plotnet to visualise what the network looks like.
#plot neural network
plotnet(nnmodel)We can vary the number of nodes in the hidden layer to see if we can improve the accuracy and perform bootstrapping to see if we can improve the accuracy on a new dataset.
Accuracynn <- matrix(NA ,nrow=10,ncol=4) #allocate memory to store accuracies
colnames(Accuracynn) <- sprintf("NN size=%s",(1:4)*5)#name columns
for (r in 1:4){#for loop for testing model sizes 5,10,15,20
for (i in 1:10){#for loop for bootstrapping each model
bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data indicies
#create model
nnmodel <- nnet(UseLevel~. , data=druguse_num_s[bootsample,], size = (r*5), maxit=1000, trace=F) #Train model and dont output trace
#predict on validation data and specify it as a classification prediction
nnpredn <- predict(nnmodel, druguse_num_s[outofbag,], type="class")
#calculate accuracy and store it
Accuracynn[i,r] <- sum(nnpredn==druguse_num_s$UseLevel[outofbag])/length(outofbag)#store accuracy
}
}
summary(Accuracynn) #summarise accuracies of each neural network## NN size=5 NN size=10 NN size=15 NN size=20
## Min. :0.7669 Min. :0.7454 Min. :0.7387 Min. :0.7455
## 1st Qu.:0.7883 1st Qu.:0.7614 1st Qu.:0.7476 1st Qu.:0.7488
## Median :0.7948 Median :0.7668 Median :0.7668 Median :0.7562
## Mean :0.7940 Mean :0.7670 Mean :0.7656 Mean :0.7584
## 3rd Qu.:0.8036 3rd Qu.:0.7708 3rd Qu.:0.7759 3rd Qu.:0.7696
## Max. :0.8122 Max. :0.7917 Max. :0.8100 Max. :0.7721
We can summarise the accuracies to see the performance of the model. 5 nodes seems to perform the best on a new data set however it performs a bit worse than previous classifiers.
We can use k-nearest neighbours a unsupervised method to predict the use level. We give it a training set and it evaluates a new point based on what the k nearest points in training set are classified as. We can create a for loop to try different values of k to find an optimal k.
druguse_num_s_unsupervised <- druguse_num_s[,-14]#create a dataframe without uselevel
AccuracyKNN <- matrix(NA, nrow = 10, ncol = 50)#allocate memory to store accuracies
colnames(AccuracyKNN) <- sprintf("KNN k=%s",1:50)
for (r in 1:50){#change k for each loop
for (i in 1:10){#repeat 10 times
bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
pKNN <- knn(druguse_num_s_unsupervised[bootsample,],druguse_num_s_unsupervised[outofbag,],druguse_num_s[bootsample,]$UseLevel,k=r)#create knn classifier and predictions
AccuracyKNN[i,r] <- sum(pKNN==druguse_num_s[outofbag,]$UseLevel)/length(outofbag)#store accuracy
}
}
summary(AccuracyKNN)#show summary of accuracies## KNN k=1 KNN k=2 KNN k=3 KNN k=4
## Min. :0.7367 Min. :0.7295 Min. :0.7454 Min. :0.7415
## 1st Qu.:0.7536 1st Qu.:0.7479 1st Qu.:0.7615 1st Qu.:0.7711
## Median :0.7597 Median :0.7516 Median :0.7728 Median :0.7836
## Mean :0.7585 Mean :0.7546 Mean :0.7745 Mean :0.7798
## 3rd Qu.:0.7655 3rd Qu.:0.7649 3rd Qu.:0.7827 3rd Qu.:0.7921
## Max. :0.7697 Max. :0.7846 Max. :0.8247 Max. :0.8055
## KNN k=5 KNN k=6 KNN k=7 KNN k=8
## Min. :0.7649 Min. :0.7862 Min. :0.7829 Min. :0.7939
## 1st Qu.:0.7702 1st Qu.:0.7905 1st Qu.:0.7928 1st Qu.:0.8006
## Median :0.7763 Median :0.7965 Median :0.7947 Median :0.8048
## Mean :0.7819 Mean :0.7981 Mean :0.7960 Mean :0.8046
## 3rd Qu.:0.7909 3rd Qu.:0.8051 3rd Qu.:0.7987 3rd Qu.:0.8094
## Max. :0.8119 Max. :0.8137 Max. :0.8115 Max. :0.8146
## KNN k=9 KNN k=10 KNN k=11 KNN k=12
## Min. :0.7971 Min. :0.7843 Min. :0.7969 Min. :0.8046
## 1st Qu.:0.8067 1st Qu.:0.8129 1st Qu.:0.8081 1st Qu.:0.8071
## Median :0.8195 Median :0.8186 Median :0.8177 Median :0.8114
## Mean :0.8160 Mean :0.8133 Mean :0.8216 Mean :0.8136
## 3rd Qu.:0.8249 3rd Qu.:0.8218 3rd Qu.:0.8377 3rd Qu.:0.8203
## Max. :0.8305 Max. :0.8294 Max. :0.8475 Max. :0.8254
## KNN k=13 KNN k=14 KNN k=15 KNN k=16
## Min. :0.7895 Min. :0.8101 Min. :0.8003 Min. :0.8041
## 1st Qu.:0.8169 1st Qu.:0.8153 1st Qu.:0.8132 1st Qu.:0.8221
## Median :0.8310 Median :0.8177 Median :0.8221 Median :0.8257
## Mean :0.8257 Mean :0.8223 Mean :0.8215 Mean :0.8266
## 3rd Qu.:0.8354 3rd Qu.:0.8270 3rd Qu.:0.8270 3rd Qu.:0.8353
## Max. :0.8460 Max. :0.8448 Max. :0.8446 Max. :0.8470
## KNN k=17 KNN k=18 KNN k=19 KNN k=20
## Min. :0.8067 Min. :0.7942 Min. :0.7986 Min. :0.8077
## 1st Qu.:0.8122 1st Qu.:0.8072 1st Qu.:0.8174 1st Qu.:0.8209
## Median :0.8216 Median :0.8128 Median :0.8280 Median :0.8259
## Mean :0.8198 Mean :0.8153 Mean :0.8271 Mean :0.8243
## 3rd Qu.:0.8253 3rd Qu.:0.8199 3rd Qu.:0.8351 3rd Qu.:0.8286
## Max. :0.8321 Max. :0.8522 Max. :0.8493 Max. :0.8353
## KNN k=21 KNN k=22 KNN k=23 KNN k=24
## Min. :0.8108 Min. :0.8032 Min. :0.7869 Min. :0.8188
## 1st Qu.:0.8200 1st Qu.:0.8158 1st Qu.:0.8073 1st Qu.:0.8241
## Median :0.8251 Median :0.8283 Median :0.8148 Median :0.8312
## Mean :0.8273 Mean :0.8255 Mean :0.8172 Mean :0.8306
## 3rd Qu.:0.8331 3rd Qu.:0.8345 3rd Qu.:0.8308 3rd Qu.:0.8369
## Max. :0.8576 Max. :0.8404 Max. :0.8456 Max. :0.8443
## KNN k=25 KNN k=26 KNN k=27 KNN k=28
## Min. :0.8084 Min. :0.8074 Min. :0.7895 Min. :0.8183
## 1st Qu.:0.8166 1st Qu.:0.8173 1st Qu.:0.8200 1st Qu.:0.8213
## Median :0.8209 Median :0.8248 Median :0.8243 Median :0.8248
## Mean :0.8208 Mean :0.8238 Mean :0.8212 Mean :0.8281
## 3rd Qu.:0.8260 3rd Qu.:0.8277 3rd Qu.:0.8276 3rd Qu.:0.8358
## Max. :0.8314 Max. :0.8393 Max. :0.8387 Max. :0.8419
## KNN k=29 KNN k=30 KNN k=31 KNN k=32
## Min. :0.8081 Min. :0.8082 Min. :0.7988 Min. :0.8029
## 1st Qu.:0.8182 1st Qu.:0.8184 1st Qu.:0.8209 1st Qu.:0.8222
## Median :0.8228 Median :0.8221 Median :0.8346 Median :0.8289
## Mean :0.8255 Mean :0.8257 Mean :0.8317 Mean :0.8271
## 3rd Qu.:0.8320 3rd Qu.:0.8315 3rd Qu.:0.8425 3rd Qu.:0.8312
## Max. :0.8466 Max. :0.8510 Max. :0.8533 Max. :0.8478
## KNN k=33 KNN k=34 KNN k=35 KNN k=36
## Min. :0.8095 Min. :0.8157 Min. :0.8045 Min. :0.8131
## 1st Qu.:0.8182 1st Qu.:0.8278 1st Qu.:0.8152 1st Qu.:0.8202
## Median :0.8202 Median :0.8301 Median :0.8312 Median :0.8239
## Mean :0.8238 Mean :0.8326 Mean :0.8264 Mean :0.8270
## 3rd Qu.:0.8324 3rd Qu.:0.8384 3rd Qu.:0.8372 3rd Qu.:0.8281
## Max. :0.8434 Max. :0.8540 Max. :0.8410 Max. :0.8553
## KNN k=37 KNN k=38 KNN k=39 KNN k=40
## Min. :0.8061 Min. :0.7951 Min. :0.8172 Min. :0.7997
## 1st Qu.:0.8177 1st Qu.:0.8151 1st Qu.:0.8253 1st Qu.:0.8198
## Median :0.8244 Median :0.8288 Median :0.8331 Median :0.8289
## Mean :0.8261 Mean :0.8269 Mean :0.8318 Mean :0.8264
## 3rd Qu.:0.8343 3rd Qu.:0.8415 3rd Qu.:0.8359 3rd Qu.:0.8340
## Max. :0.8543 Max. :0.8484 Max. :0.8484 Max. :0.8475
## KNN k=41 KNN k=42 KNN k=43 KNN k=44
## Min. :0.8200 Min. :0.8026 Min. :0.8017 Min. :0.8194
## 1st Qu.:0.8327 1st Qu.:0.8161 1st Qu.:0.8159 1st Qu.:0.8281
## Median :0.8418 Median :0.8179 Median :0.8248 Median :0.8349
## Mean :0.8403 Mean :0.8249 Mean :0.8248 Mean :0.8352
## 3rd Qu.:0.8471 3rd Qu.:0.8372 3rd Qu.:0.8341 3rd Qu.:0.8394
## Max. :0.8600 Max. :0.8459 Max. :0.8478 Max. :0.8599
## KNN k=45 KNN k=46 KNN k=47 KNN k=48
## Min. :0.8195 Min. :0.8095 Min. :0.8244 Min. :0.8088
## 1st Qu.:0.8261 1st Qu.:0.8189 1st Qu.:0.8294 1st Qu.:0.8168
## Median :0.8339 Median :0.8207 Median :0.8348 Median :0.8286
## Mean :0.8331 Mean :0.8207 Mean :0.8360 Mean :0.8274
## 3rd Qu.:0.8408 3rd Qu.:0.8248 3rd Qu.:0.8399 3rd Qu.:0.8350
## Max. :0.8437 Max. :0.8304 Max. :0.8513 Max. :0.8516
## KNN k=49 KNN k=50
## Min. :0.8078 Min. :0.8131
## 1st Qu.:0.8211 1st Qu.:0.8182
## Median :0.8293 Median :0.8297
## Mean :0.8293 Mean :0.8290
## 3rd Qu.:0.8405 3rd Qu.:0.8360
## Max. :0.8447 Max. :0.8495
We can plot the mean accuracy and the range of each k to see how accuracy varies as we change the number of nearest neighbours.
AccuracyKNNdf <- data.frame(K=1:50, Mean=apply(AccuracyKNN,2,mean), Minimum=apply(AccuracyKNN,2,min), Maximum=apply(AccuracyKNN,2,max))#put accuracy statistics in a dataframe
ggplot(data = AccuracyKNNdf, aes(x=K, y=Mean, ymin=Minimum, ymax=Maximum)) + geom_pointrange() + geom_smooth(method='loess') + ggtitle("KNN accuracy for K = 1 to 50") + ylab("Accuracy") #plot the mean and max and minimum accuracy for each k with a line of best fitAt about k=20-25 the accuracy seems to plato suggesting above this point we risk overfitting the data. The accuracy is similar to that of previous methods.
We can combine these methods above by taking the modal prediction of the above classifiers with equal weights in the hope of improving accuracy and reducing the range of accuracies of the model on different data sets. As previous methods had similar accuracies this may allow us to obtain a better overall accuracy as even though they have similar accuracies they may not be getting the same data points misclassified.
Accuracyen <- matrix(NA, nrow = 10, ncol = 1)#allocate memory to store accuracies
colnames(Accuracyen) <- "Ensemble" #name columns
#find mode function
getmodalvalue <- function(x) {
u <- unique(x)#find unique values of x
u[which.max(tabulate(match(x, u)))]#find modal value of x
}
for (i in 1:10){#repeat 10 times for bootstrapping
bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
#create models
models <- list(glm(UseLevel ~ ., family=binomial(link='logit'), data=druguse_num_s[bootsample,]),
naiveBayes(UseLevel ~ ., data=druguse_num_s[bootsample,]),
lda(UseLevel ~ ., data=druguse_num_s[bootsample,]),
mda(UseLevel ~ ., data=druguse_num_s[bootsample,], subclasses = 3),
svm(UseLevel~.,data = druguse_num_s[bootsample,], kernel="linear"))
predictions <- lapply(models,predict,newdata=druguse_num_s[outofbag,])#create predictions
predictions[[1]] <- ifelse(predictions[[1]]>0.5, 'high','low')#set cut off for glm
predictions[[3]] <- predictions[[3]]$class#extract class from lda predictions
predictions[[6]] <- knn(druguse_num_s_unsupervised[bootsample,],druguse_num_s_unsupervised[outofbag,],druguse_num_s[bootsample,]$UseLevel,k=25)#create knn classifier and predictions
nnmodel <- nnet(UseLevel~. , data=druguse_num_s[bootsample,], size = 5, maxit=1000, trace=F)#create neural network with no trace
predictions[[7]] <- predict(nnmodel, druguse_num_s[outofbag,], type="class") #add neural network predictions
predictions <- apply(data.frame(predictions),2,as.character)#set all predictions elements to character type using apply
finalpred <- apply(predictions,1,getmodalvalue)#find mode of each row of the predictions from each model
Accuracyen[i] <- sum(finalpred==druguse_num_s$UseLevel[outofbag])/length(outofbag)#store accuracy
}
summary(Accuracyen)#sumarise accuracy of equally weighted ensamble## Ensemble
## Min. :0.8166
## 1st Qu.:0.8262
## Median :0.8333
## Mean :0.8331
## 3rd Qu.:0.8377
## Max. :0.8542
We then use summary to show the performance of our classifier on a new data set. We see the performance is similar to some of the techniques on their own. If we change the weights of each prediction we could improve this. We could do this ourselves trying different values but this would take a long time. So we could use a neural network to change the weights of each model in our classifier. A neural network could learn appropriate weights to improve performance of the classifier faster than us using a for loop to change weights to find the best weights.
classifiers <- sprintf("%s",1:7)#create colnames for 2nd nnet input
Accuracyennn <- matrix(NA, nrow=10, ncol=4)#create matrix to store accuracies
colnames(Accuracyennn) <- sprintf("Nodes %s", (1:4)*5)#create names for accuracy matrix columns
for (r in 1:4){#try different nn sizes 5,10,15,20
for (i in 1:10){#repeat 10 times
bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
#create models
models <- list(glm(UseLevel ~ ., family=binomial(link='logit'), data=druguse_num_s[bootsample,]),
naiveBayes(UseLevel ~ ., data=druguse_num_s[bootsample,]),
lda(UseLevel ~ ., data=druguse_num_s[bootsample,]),
mda(UseLevel ~ ., data=druguse_num_s[bootsample,], subclasses = 3),
svm(UseLevel~.,data = druguse_num_s[bootsample,], kernel="linear"))
#predict using models
predictions <- lapply(models,predict,newdata=druguse_num_s[outofbag,])
predictions[[1]] <- ifelse(predictions[[1]]>0.5, 'high','low')#set cut off for glm
predictions[[3]] <- predictions[[3]]$class#extract class for lda
predictions[[6]] <- knn(druguse_num_s_unsupervised[bootsample,],druguse_num_s_unsupervised[outofbag,],druguse_num_s[bootsample,]$UseLevel,k=25)#create knn classifier and predictions
nnmodel1 <- nnet(UseLevel~. , data=druguse_num_s[bootsample,], size = 5, maxit=1000, trace=F)#make nnet with no trace
predictions[[7]] <- predict(nnmodel1, druguse_num_s[outofbag,], type="class") #predict using nnet
predictions <- data.frame(apply(data.frame(predictions),2,as.character)) #make each element of character type using apply
colnames(predictions) <- classifiers #assign names to columns for next nnet
predictions$UseLevel <- druguse$UseLevel[outofbag] #add true uselevels as column to predictions data frame
nnmodel2 <- nnet(UseLevel~.,data=predictions, size=(r*5), maxit=1000, trace=F) #make 2nd neural network using predictions from all models with no trace
#make new samples
bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
#create predictions on new inicies using models made previously
predictions <- lapply(models,predict,newdata=druguse_num_s[outofbag,])
predictions[[1]] <- ifelse(predictions[[1]]>0.5, 'high','low')
predictions[[3]] <- predictions[[3]]$class
predictions[[6]] <- knn(druguse_num_s_unsupervised[bootsample,],druguse_num_s_unsupervised[outofbag,],druguse_num_s[bootsample,]$UseLevel,k=25)#create knn classifier and predictions
predictions[[7]] <- predict(nnmodel1, druguse_num_s[outofbag,], type="class") #make predictions using neural net
predictions <- data.frame(apply(data.frame(predictions),2,as.character)) #make all elements of character type
colnames(predictions) <- classifiers #assign same names to columns
predictions$UseLevel <- druguse$UseLevel[outofbag] #add true use level
finalpred <- predict(nnmodel2,predictions,type='class') #make final predictions using 2nd nnet and new predictions
Accuracyennn[i,r] <- sum(finalpred==druguse_num_s$UseLevel[outofbag])/length(outofbag)#store accuracy
}
}
summary(Accuracyennn)#sumarise accuracies## Nodes 5 Nodes 10 Nodes 15 Nodes 20
## Min. :0.8182 Min. :0.8291 Min. :0.8104 Min. :0.8209
## 1st Qu.:0.8330 1st Qu.:0.8403 1st Qu.:0.8376 1st Qu.:0.8363
## Median :0.8389 Median :0.8468 Median :0.8433 Median :0.8408
## Mean :0.8359 Mean :0.8475 Mean :0.8403 Mean :0.8416
## 3rd Qu.:0.8394 3rd Qu.:0.8570 3rd Qu.:0.8492 3rd Qu.:0.8438
## Max. :0.8450 Max. :0.8702 Max. :0.8668 Max. :0.8681
Summarise the accuracy of the classifier to see how the classifier performs. This technique with 10 nodes seems to perform best as the neural network modifies the weights of each classifiers performance based on its relative performance we see a slight improvement over the methods on their own.
To estimate how the model performs on a new data set I have used bootstrapping where the training data set is randomly sampled with replacement from the whole data set and the rest of the data set is used as a validation data set. To compare the classifiers and the ensemble learning techniques we can plot a boxplot. We only plot the best performing classifier for classification techniques where we tested multiple parameters.
comparison <- data.frame("GLM"=Accuracy, "Naive Bayes"=Accuracybayes, "LDA"=Accuracylda,"MDA"=Accuracymda[,3],"SVM"=Accuracysvms[,1],"NN"=Accuracynn[,1], "KNN"=AccuracyKNN[,25],"Ensemble"=Accuracyen,"Ensemble NN"=Accuracyennn[,2])#create data frame of accuracies of the models with their best parameters
melted_comparison <- melt(comparison)#reshape to make it easier to plot## No id variables; using all as measure variables
ggplot(melted_comparison, aes(x=variable, y=value, fill=variable)) + geom_boxplot(colour=rainbow(9)) + coord_flip() + theme(legend.position="none") + xlab("Classifier") + ylab("Accuracy") + ggtitle("Comparison of classification techniques for use level") #create horizontal boxplot of all techniques accuracies The ensamble neural net technique perform best as some classifiers perform better in certain cases and the neural net tries to learn the best possible combination of the classifiers predictions to get the highest accuracy which it has done successfully but the performance of the logistic regression classifier is similar. However a predicted average accuracy of 85% on a new data set is high.
Add a new column of factors for whether the individual has ever used heroin.
#create used heroin column and assign yes or no
druguse$usedheroin[druguse$heroin==0] <- 'no'
druguse$usedheroin[druguse$heroin!=0] <- 'yes'
druguse$usedheroin <- as.factor(druguse$usedheroin) #change from character to factor type
str(druguse) #check column has been added to the dataframe and is of correct type## 'data.frame': 1885 obs. of 34 variables:
## $ agegroup : Factor w/ 6 levels "18-24","25-34",..: 1 1 2 2 3 1 1 2 3 1 ...
## $ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 1 2 1 1 2 ...
## $ education : num 0.455 -0.611 1.164 1.164 0.455 ...
## $ country : Factor w/ 7 levels "Australia","Canada",..: 6 7 6 6 6 7 7 6 6 6 ...
## $ ethnicity : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ neuroticism : num -0.1488 -0.921 1.0212 -1.5508 -0.0519 ...
## $ extraversion : num -0.948 1.741 0.962 1.114 0.962 ...
## $ opentoexperience : num -1.4242 0.5833 -0.9763 -0.0193 -1.119 ...
## $ agreeableness : num -0.4532 -1.3429 -0.6063 -0.0173 0.9416 ...
## $ conscientiousness: num 0.26 1.134 -0.143 1.462 1.631 ...
## $ impulsiveness : num 0.53 0.53 -1.38 0.193 -1.38 ...
## $ sensation : num -0.8464 1.2247 -1.1808 0.0799 -2.0785 ...
## $ caffeine : num 4 6 6 6 5 6 5 6 6 5 ...
## $ chocolate : num 5 5 6 5 6 6 4 6 6 6 ...
## $ nicotine : num 2 6 5 4 0 6 4 3 0 4 ...
## $ alcohol : num 4 4 5 5 3 5 3 3 5 5 ...
## $ amphetamine : num 0 3 0 0 0 0 2 0 0 4 ...
## $ amylnitrite : num 0 0 2 0 0 0 3 0 0 2 ...
## $ benzodiaz : num 0 2 0 0 0 0 3 0 0 3 ...
## $ cannabis : num 2 5 3 2 0 6 5 0 1 4 ...
## $ cocaine : num 0 3 2 0 0 0 2 0 0 2 ...
## $ crack : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ecstasy : num 0 0 0 0 0 0 3 0 0 4 ...
## $ heroin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ketamine : num 0 0 0 0 0 0 0 0 0 4 ...
## $ legalhighs : num 0 3 2 4 0 5 3 0 0 3 ...
## $ LSD : num 0 0 0 0 0 0 3 0 0 2 ...
## $ methadone : num 0 0 0 0 0 2 4 0 0 0 ...
## $ mushrooms : num 0 3 0 0 0 3 3 0 0 2 ...
## $ volatiles : num 0 0 0 0 0 0 0 0 0 0 ...
## $ any : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 1 2 2 2 2 2 ...
## $ severity : num 4 25 14 10 0 22 35 3 1 34 ...
## $ UseLevel : Factor w/ 2 levels "low","high": 1 2 2 1 1 2 2 1 1 2 ...
## $ usedheroin : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Create the training and test data sets including the new used heroin column and other relevant columns.
druguseheroin_predictors <- druguse[c(1:23,25:30,34)]#create dataframe only of predictors removing unwanted columns
druguseheroin_train <- druguseheroin_predictors[1:1400,] #create training data by removing rows of test data
druguseheroin_test <- druguseheroin_predictors[1401:1885,]#create test data from remaining rowsCreate a random forest, a confusion table and calculate the accuracy.
rf.heroin <- randomForest(usedheroin ~ ., data=druguseheroin_train) #create a random forest
prf=predict(rf.heroin, newdata=druguseheroin_test)#Create predictions
tablerfh <- table(prf,druguseheroin_test$usedheroin) #create confusion table
#calculate incorrect and correct classifications of each type
TP=tablerfh[1]
FP=tablerfh[2]
TN=tablerfh[3]
FN=tablerfh[4]
Accuracyrfh <- (TP+TN)/(TP+TN+FN+FP)#calculate accuracy
tablerfh #output table##
## prf no yes
## no 399 35
## yes 6 45
Accuracyrfh #output accuracy## [1] 0.8948454
The random forest performs well however in the data set there arent many people who have used heroin so a classifier can perform well on the data set by being very biased towards picking no. This is why in the confusion table we have more false negatives than false positives suggesting the classifier may be doing this to an extent. The numbers of yes and no are shown below to illistrate this.
x <- druguse %>% group_by(usedheroin) %>% tally() #tally number of people who have and havent used heroin
x #print tibble## # A tibble: 2 x 2
## usedheroin n
## <fct> <int>
## 1 no 1605
## 2 yes 280
x[1,2]/nrow(druguse) #show accuracy if just predict no all the time## n
## 1 0.8514589
As shown above 85% of the data set is no so if we produce a classifier just predicting no we can expect an accuracy of arround 85%. This means the random forest only performs slightly better than this.
Predict severity based on first 14 predictors and cannabis use but no other illegal substances and without the non-significant predictors shown in the data analysis (alcohol, extraversion and country). We use scaled data as this will likely improve the performance of our regession models and create the data set of the predictors and severity. As this is regression we will compare the regressors by their mean squared error. We can only choose regression techniques as severity isnt a class. We can evaluate each technique and if it performs well we can use it in an ensemble like in question 3 to see if we can lower the MSE.
druguse_pred_sev <- druguse_num_s_unsupervised #without variables with high p value
druguse_pred_sev$cannabis <- scale(druguse$cannabis) #add scaled cannabis usage
druguse_pred_sev$severity <- scale(druguse$severity) #add scaled severityLasso and Ridge regression are penalised linear regression techniques. So tries to find a set of coefficents of the predictors but subject to a penalty on large values of coefficents this means it is less susceptible to data sets with extreme values and thus is a shrinkage method. This could make it work well with our data.
mselasso <- matrix(NA,nrow = 10,ncol = 21)#create array to store mse
colnames(mselasso) <- sprintf("Lasso/Ridge alpha=%s",(1:21-1)/20)#name columns
for (r in 1:21){#alpha = 0,0.05,0.1....1 with alpha=0 being pure ridge and alpha=1 pure lasso
for (i in 1:10){#repeat 10 times
bootsample <- sample(nrow(druguse_pred_sev),nrow(druguse_pred_sev), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_pred_sev), bootsample)#create validation data indicies
#create input for model
x <- model.matrix(severity~., druguse_pred_sev[bootsample,])#train x
y <- druguse_pred_sev$severity[bootsample,] #train y
modellasso <- cv.glmnet(x,y, alpha = (r-1)/20)#create lasso/ridge with alpha=0 to 1 and find optimal lambda
plas <- predict(modellasso, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#predict on validation data
mselasso[i,r] <- mean((plas - druguse_pred_sev$severity[outofbag])^2)#calculate mean squared error and store
}
}
summary(mselasso)#sumarise mse## Lasso/Ridge alpha=0 Lasso/Ridge alpha=0.05 Lasso/Ridge alpha=0.1
## Min. :0.2213 Min. :0.2029 Min. :0.1852
## 1st Qu.:0.2560 1st Qu.:0.2247 1st Qu.:0.1998
## Median :0.2678 Median :0.2278 Median :0.2280
## Mean :0.2658 Mean :0.2300 Mean :0.2255
## 3rd Qu.:0.2823 3rd Qu.:0.2377 3rd Qu.:0.2507
## Max. :0.2901 Max. :0.2538 Max. :0.2595
## Lasso/Ridge alpha=0.15 Lasso/Ridge alpha=0.2 Lasso/Ridge alpha=0.25
## Min. :0.1983 Min. :0.1787 Min. :0.1925
## 1st Qu.:0.2106 1st Qu.:0.2162 1st Qu.:0.2097
## Median :0.2203 Median :0.2255 Median :0.2228
## Mean :0.2245 Mean :0.2288 Mean :0.2215
## 3rd Qu.:0.2389 3rd Qu.:0.2419 3rd Qu.:0.2310
## Max. :0.2529 Max. :0.2753 Max. :0.2463
## Lasso/Ridge alpha=0.3 Lasso/Ridge alpha=0.35 Lasso/Ridge alpha=0.4
## Min. :0.1905 Min. :0.1722 Min. :0.1937
## 1st Qu.:0.2249 1st Qu.:0.1992 1st Qu.:0.2125
## Median :0.2368 Median :0.2143 Median :0.2350
## Mean :0.2389 Mean :0.2206 Mean :0.2384
## 3rd Qu.:0.2547 3rd Qu.:0.2389 3rd Qu.:0.2583
## Max. :0.3005 Max. :0.2875 Max. :0.2914
## Lasso/Ridge alpha=0.45 Lasso/Ridge alpha=0.5 Lasso/Ridge alpha=0.55
## Min. :0.1898 Min. :0.1838 Min. :0.1712
## 1st Qu.:0.2107 1st Qu.:0.1991 1st Qu.:0.2081
## Median :0.2309 Median :0.2091 Median :0.2202
## Mean :0.2249 Mean :0.2123 Mean :0.2258
## 3rd Qu.:0.2427 3rd Qu.:0.2222 3rd Qu.:0.2497
## Max. :0.2489 Max. :0.2498 Max. :0.2738
## Lasso/Ridge alpha=0.6 Lasso/Ridge alpha=0.65 Lasso/Ridge alpha=0.7
## Min. :0.2120 Min. :0.1804 Min. :0.1778
## 1st Qu.:0.2325 1st Qu.:0.2012 1st Qu.:0.2074
## Median :0.2384 Median :0.2064 Median :0.2254
## Mean :0.2400 Mean :0.2167 Mean :0.2192
## 3rd Qu.:0.2522 3rd Qu.:0.2318 3rd Qu.:0.2306
## Max. :0.2642 Max. :0.2614 Max. :0.2451
## Lasso/Ridge alpha=0.75 Lasso/Ridge alpha=0.8 Lasso/Ridge alpha=0.85
## Min. :0.1945 Min. :0.1646 Min. :0.1595
## 1st Qu.:0.2193 1st Qu.:0.2106 1st Qu.:0.2084
## Median :0.2437 Median :0.2249 Median :0.2167
## Mean :0.2354 Mean :0.2232 Mean :0.2165
## 3rd Qu.:0.2516 3rd Qu.:0.2447 3rd Qu.:0.2394
## Max. :0.2659 Max. :0.2507 Max. :0.2693
## Lasso/Ridge alpha=0.9 Lasso/Ridge alpha=0.95 Lasso/Ridge alpha=1
## Min. :0.1640 Min. :0.1761 Min. :0.1823
## 1st Qu.:0.1923 1st Qu.:0.2095 1st Qu.:0.2123
## Median :0.2229 Median :0.2129 Median :0.2273
## Mean :0.2150 Mean :0.2110 Mean :0.2251
## 3rd Qu.:0.2384 3rd Qu.:0.2200 3rd Qu.:0.2410
## Max. :0.2495 Max. :0.2301 Max. :0.2572
We can plot alpha against the average mean squared error to see the optimal alpha.
#create dataframe from data for ggplot
lasso <- data.frame("alpha"=(1:21 - 1)/20, "avg mse"=apply(mselasso,2,mean))
#plot avg mse and alpha in ggplot
ggplot(data = lasso, aes(x = alpha, y = avg.mse)) + geom_smooth() + geom_point() + ggtitle("Average MSE for alpha")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The plot suggests better peformance as alpha goes to 1 ie lasso rather than ridge however the difference is small.
We can apply a general linear model again this time using family gaussian as this is regression question and in our data analysis the data appeared to be gaussian distributed. We also use a basis transformation on the model to include the products of predictors as well as this may improve accuracy at the risk of overfitting.
mseglm <- matrix(NA, nrow = 10, ncol = 1)#create place to store mse
colnames(mseglm) <- "GLM"#mame column
for (i in 1:10){#repeat 10 times
bootsample <- sample(nrow(druguse_pred_sev),nrow(druguse_pred_sev), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_pred_sev), bootsample)#create validation data indicies
modelglm <- glm(severity~.^2, data = druguse_pred_sev, family = gaussian(link = "identity")) #create glm model using the predictors and their products
pglm <- predict(modelglm, druguse_pred_sev[outofbag, ])#predict on validation data
mseglm[i] <- mean((pglm - druguse_pred_sev$severity[outofbag,])^2) #calculate mean squared error
}
summary(mseglm)#show summary of mse## GLM
## Min. :0.2711
## 1st Qu.:0.2763
## Median :0.2888
## Mean :0.2892
## 3rd Qu.:0.2975
## Max. :0.3202
The GLM performs worse than ridge regression this could be due to extreme values that dont effect lasso/ridge regression as much as a method that doesn’t penalise them such as a GLM.
MARS uses regression splines that split the predictors into sections which it fits a linear model. It does this adaptively to multiple variables it does this by going forwards first finding a pair of predictors that produces the lowest sum of squares residual error by testing every combination. Adding more predictors this way. It then goes backwards removing the least effective terms until it finds the best model. We can create a model as below and summarise the accuracies. This could allow us to further remove unuseful predictors and as this method uses splines it can create a peicewise linear model which may be a better regressor than one that is completely linear.
msemars <- matrix(NA,nrow = 10, ncol = 1)#create place to store mse
colnames(msemars) <- "MARS"#name column
for (i in 1:10){#repeat 10 times
bootsample <- sample(nrow(druguse_pred_sev),nrow(druguse_pred_sev), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_pred_sev), bootsample)#create validation data indicies
x <- druguse_pred_sev[bootsample, -15]
y <- druguse_pred_sev$severity[bootsample]
modelmars <- mars(x, y, forward.step = T, prune = T, degree = 1) #create linear mars model
pmars <- predict(modelmars, druguse_pred_sev[outofbag,-15])#predict on validation data
msemars[i] <- mean((pmars - druguse_pred_sev$severity[outofbag])^2) #calculate mean squared error
}
summary(msemars)#show summary of mse## MARS
## Min. :0.2844
## 1st Qu.:0.3110
## Median :0.3333
## Mean :0.3254
## 3rd Qu.:0.3372
## Max. :0.3522
MARS offers higher MSE than the previous techniques.
We can plot the cuts of each predictor from the model as below. Straight lines indicate predictors that the model thinks are not significant through it’s forward step and then backwards pruning of predictors.
par(mfrow = c(4, 4))#create 4x4 grid in plotting device
for (i in 1:14)#plot all predictors
{
xp <- matrix(sapply(druguse_pred_sev[1:14], mean), nrow(druguse_pred_sev), ncol(druguse_pred_sev) - 1, byrow = TRUE)#create matrix of means
xr <- sapply(druguse_pred_sev, range)#find range of each predictor
xp[, i] <- seq(xr[1, i], xr[2, i], len=nrow(druguse_pred_sev))#create sequence of the length of the rows of df across the range of each predictor
xf <- predict(modelmars, xp)#predict on the sequence
plot(xp[, i], xf, xlab = names(druguse_pred_sev)[i], ylab = "", type = "l");#plot predictions against range of each predictor
}We can use a neural network again testing different sizes to predict severity. Neural nets try to predict a function that relates the predictors to the value being predicted so could be a good technique for regression. We can test different sizes to see the optimal size
msenn <- matrix(NA,nrow=10,ncol=4) #allocate memory to store accuracies
colnames(msenn) <- sprintf("NN nodes=%s", (1:4)*5)
for (r in 1:4){#for loop for testing model sizes 5,10,15,20
for (i in 1:10){#for loop for cross validating each model
bootsample <- sample(nrow(druguse_pred_sev),nrow(druguse_pred_sev), replace = T)#take sample inicies
outofbag <- setdiff(1:nrow(druguse_pred_sev), bootsample)#create validation data
nnmodel <- nnet(severity~. , data=druguse_pred_sev[bootsample,], size = (r*5), maxit=1000, linout=T, trace=F) #create nnet with linear output
#predict on validation data
nnpredn = predict(nnmodel,druguse_pred_sev[outofbag,])
#calculate accuracy and store it
msenn[i,r] <- mean((nnpredn - druguse_pred_sev$severity[outofbag])^2)
}
}
summary(msenn)#summarise mseWe can use summary to see the mse of the neural nets. The performance of the neural nets is poor with a high range as there are extreme outliers this suggests its not a good model in this case.
We can provide the models not including the neural net as its performance is poor into an ensamble technique by taking the mean average output of the predictions of the regressors rather than the mode as in the classification case. This will hopefully lower the MSE.
mseen <- matrix(NA, nrow=10, ncol=1)#create matrix to store accuracies
colnames(mseen) <- "Ensemble"#create names for accuracy matrix
for (i in 1:10){#repeat 10 times
bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
#create inputs for rigde ands lasso
x <- model.matrix(severity~., druguse_pred_sev[bootsample,])
y <- druguse_pred_sev$severity[bootsample]
modelridge <- cv.glmnet(x,y, alpha = 0)#create ridge model
ridgepred <- predict(modelridge, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
modellasso <- cv.glmnet(x,y, alpha = 1)#create lasso model
lassopred <- predict(modellasso, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
modelglm <- glm(severity~.^2, data = druguse_pred_sev, family = gaussian(link = "identity")) #create glm
pglm <- predict(modelglm, druguse_pred_sev[outofbag, ])#predict on validation data
#create input for mars
x <- druguse_pred_sev[bootsample, -15]
y <- druguse_pred_sev$severity[bootsample]
modelmars <- mars(x, y, forward.step = T, prune = T, degree = 1) #create mars model
pmars <- predict(modelmars, druguse_pred_sev[outofbag,-15])#predict on validation data
predictions <- data.frame("ridge"=ridgepred, "lasso"=lassopred, "glm"=pglm, "mars"=pmars)#create data frame of predictions
finalpred <- apply(predictions,1,mean) #make final predictions by averaging predictions of each model
mseen[i] <- mean((finalpred - druguse_pred_sev$severity[outofbag])^2) #calculate mse
}
summary(mseen)#summarise mse## Ensemble
## Min. :0.2110
## 1st Qu.:0.2145
## Median :0.2182
## Mean :0.2189
## 3rd Qu.:0.2214
## Max. :0.2344
This offers an improvement over the regression techniques individually with a small range.
We can as before use a neural net instad of averaging to see if it can find a better way of combining them than using the mean as different regressors are better at predicting than others in certain cases.
mseennn <- matrix(NA, nrow=10, ncol=4)#create matrix to store accuracies
colnames(mseennn) <- sprintf("Ensemble NN nodes=%s", (1:4)*5)#create names for accuracy matrix
for (r in 1:4){
for (i in 1:10){#repeat 10 times
bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
#create inputs for rigde ands lasso
x <- model.matrix(severity~., druguse_pred_sev[bootsample,])
y <- druguse_pred_sev$severity[bootsample]
modelridge <- cv.glmnet(x,y, alpha = 0)#create ridge model
ridgepred <- predict(modelridge, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
modellasso <- cv.glmnet(x,y, alpha = 1)#create lasso model
lassopred <- predict(modellasso, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
modelglm <- glm(severity~.^2, data = druguse_pred_sev, family = gaussian(link = "identity")) #create glm
pglm <- predict(modelglm, druguse_pred_sev[outofbag, ])#predict on validation data
#create input for mars
x <- druguse_pred_sev[bootsample, -15]
y <- druguse_pred_sev$severity[bootsample]
modelmars <- mars(x, y, forward.step = T, prune = T, degree = 1) #create mars model
pmars <- predict(modelmars, druguse_pred_sev[outofbag,-15])#predict on validation data
predictions <- data.frame("ridge"=ridgepred, "lasso"=lassopred, "glm"=pglm, "mars"=pmars)#create data frame of predictions
predictions$severity <- druguse_pred_sev$severity[outofbag] #add severity column
nnmodelen <- nnet(severity~., data=predictions, linout=T, maxit=1000, size=(r*5), trace=F)#create nn model using predictors
#create new samples
bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
ridgepred <- predict(modelridge, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
lassopred <- predict(modellasso, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
pglm <- predict(modelglm, druguse_pred_sev[outofbag, ])#predict on validation data
pmars <- predict(modelmars, druguse_pred_sev[outofbag,-15])#predict on validation data
predictions2 <- data.frame("ridge"=ridgepred, "lasso"=lassopred, "glm"=pglm, "mars"=pmars)#create data frame of predictions
predictions2$severity <- druguse_pred_sev$severity[outofbag] #add severity column
finalpred <- predict(nnmodelen, newdata=predictions2) #use nnet model to make predictions
mseennn[i,r] <- mean((finalpred - druguse_pred_sev$severity[outofbag])^2) #calculate mse
}
}
summary(mseennn)#summarise mse## Ensemble NN nodes=5 Ensemble NN nodes=10 Ensemble NN nodes=15
## Min. :0.04419 Min. :0.03715 Min. :0.05319
## 1st Qu.:0.05180 1st Qu.:0.04598 1st Qu.:0.06037
## Median :0.05686 Median :0.05569 Median :0.06986
## Mean :0.05930 Mean :0.05451 Mean :0.08604
## 3rd Qu.:0.06745 3rd Qu.:0.06069 3rd Qu.:0.09617
## Max. :0.07809 Max. :0.07316 Max. :0.19879
## Ensemble NN nodes=20
## Min. :0.04467
## 1st Qu.:0.05533
## Median :0.07329
## Mean :0.18525
## 3rd Qu.:0.11521
## Max. :1.11177
The performance is best with with 10 nodes in the hidden layer and is substantially better than any previous technique.
We can compare the relative performance of the models by using a boxplot to show their MSE. We have to remove some extreme outliers from the neural nets and only plot pure ridge and lasso.
mse <- cbind(mselasso[,c(1,21)],mseglm,msemars,msenn,mseen,mseennn)#combine mse matricies
mse_melted <- melt(mse)#melt to reshape and turn to data frame
mse_melted_outlier <- mse_melted[mse_melted$value<1,] #remove extreme outliers from neural nets
ggplot(mse_melted_outlier, aes(x=Var2, y=value, fill=Var2)) + geom_boxplot(colour=rainbow(13)) + coord_flip() + theme(legend.position="none") + xlab("Regression technique") + ylab("MSE") + ggtitle("Comparison of regression techniques for severity") #crate boxplot of all techniques msesWe see the ensamble techniques offer the lowest MSE like they did in the classification problem with the neural network ensamble technique having musch lower MSEs than any other method. 10 node neural net ensemble performing the best but other numbers of nodes performing similarly.The neural nets performing badly as techniques on their own but suprisingly the best when used to combine the outputs of the other regressors.
In question 1 we showed that the pvalues of country, extraversion and alcohol are not significant in predicting use level or severity suggesting they aren’t relevant in predicting substance abuse.
druguse %>% group_by(UseLevel, alcohol) %>% tally()#group by uselevel and alcohol in the dataframe and tally cases## # A tibble: 14 x 3
## # Groups: UseLevel [?]
## UseLevel alcohol n
## <fct> <dbl> <int>
## 1 low 0 24
## 2 low 1 18
## 3 low 2 22
## 4 low 3 73
## 5 low 4 116
## 6 low 5 365
## 7 low 6 231
## 8 high 0 10
## 9 high 1 16
## 10 high 2 46
## 11 high 3 125
## 12 high 4 171
## 13 high 5 394
## 14 high 6 274
we can see for uselevel high and low theyre similarly distributed suggesting that regardless of druguse alcohol use is similarly distributed
par(mfrow=c(1,2))#create space in ploting device to have histograms next to each other
hist(druguse$alcohol[druguse$UseLevel=='high'],main = "Alcohol given high use level", xlab ="Alcohol", breaks=6)#plot histogram
hist(druguse$alcohol[druguse$UseLevel=='low'],main = "Alcohol given low use level", xlab ="Alcohol", breaks=6)#plot histogramPlotting the histograms we can see they are very similarly distributed in both cases suggesting its not a good predictor. This is why it has a correlation near to zero and a high pvalue. The same is true of country and extraversion. In the case of country however it may be bacause most of the data comes from the UK or US and theres not much data for other countries.
druguse %>% group_by(UseLevel, country) %>% tally()#group by uselevel and country in the dataframe and tally cases## # A tibble: 14 x 3
## # Groups: UseLevel [?]
## UseLevel country n
## <fct> <fct> <int>
## 1 low Australia 8
## 2 low Canada 30
## 3 low NewZealand 1
## 4 low Other 27
## 5 low RepublicofIreland 5
## 6 low UK 716
## 7 low USA 62
## 8 high Australia 46
## 9 high Canada 57
## 10 high NewZealand 4
## 11 high Other 91
## 12 high RepublicofIreland 15
## 13 high UK 328
## 14 high USA 495
par(mfrow=c(1,2))#create space in ploting device to have histograms next to each other
hist(druguse$extraversion[druguse$UseLevel=='high'],main = "Extraversion given high use level", xlab ="Extraversion", breaks=6)#plot histogram
hist(druguse$extraversion[druguse$UseLevel=='low'],main = "Extraversion given low use level", xlab ="Extraversion", breaks=6)#plot histogramSeverity and Use Level are highly correlated so the same predictors are also not significant for the same reasons. However in the case of any drug use the unsignificant predictors are different.
pvalues_cor_any <- melted_cor_pvalues_ord[melted_cor_pvalues_ord$Var1=="any",]#find pvalues relating to any
head(pvalues_cor_any)#show head## Var1 Var2 value
## 130 any country 5.098144e-01
## 460 any chocolate 1.606900e-01
## 229 any extraversion 3.176931e-02
## 427 any caffeine 1.643696e-02
## 526 any alcohol 5.556716e-05
## 196 any neuroticism 1.407483e-06
We can see country and chocolate are not significant at the 5% significance level for any.
druguse %>% group_by(any, chocolate) %>% tally()#group by any and chocolate and tally## # A tibble: 12 x 3
## # Groups: any [?]
## any chocolate n
## <fct> <dbl> <int>
## 1 FALSE 0 5
## 2 FALSE 3 5
## 3 FALSE 4 21
## 4 FALSE 5 67
## 5 FALSE 6 97
## 6 TRUE 0 27
## 7 TRUE 1 3
## 8 TRUE 2 10
## 9 TRUE 3 49
## 10 TRUE 4 275
## 11 TRUE 5 616
## 12 TRUE 6 710
druguse %>% group_by(any, country) %>% tally()#group by any and country and tally## # A tibble: 12 x 3
## # Groups: any [?]
## any country n
## <fct> <fct> <int>
## 1 FALSE Australia 1
## 2 FALSE Canada 3
## 3 FALSE Other 5
## 4 FALSE UK 181
## 5 FALSE USA 5
## 6 TRUE Australia 53
## 7 TRUE Canada 84
## 8 TRUE NewZealand 5
## 9 TRUE Other 113
## 10 TRUE RepublicofIreland 20
## 11 TRUE UK 863
## 12 TRUE USA 552
From this we can see that this is because in the case of chocolate the data is similarly distributed for true and false this is shown in the histograms drawn below. In the case of country this is bacause there aren’t mant falses for any country as shown above. This why they have a correlation of around 0.
par(mfrow=c(1,2))#create space in ploting device to have histograms next to each other
hist(druguse$chocolate[druguse$any==T],main = "Chocolate given they use drugs", xlab ="Chocolate", breaks=6)#plot histogram
hist(druguse$chocolate[druguse$any==F],main = "Chocolate given they use drugs", xlab ="Chocolate", breaks=6)#plot histogramIn the summary of the logistic regression model of question 2 all of these predictors werent significant in the model however country being UK was significant where as other countries were not and the age groups other than 45-54 and 55-64 weren’t significant and ethnicity, nueroticism, agreeableness and caffeine weren’t significant. In the summary of that model it suggests gender being male, country being UK, open to experience, education, concientiousness, sensation and nicotine are significant. These being significant make sense things like alcohol are likely not significant because the data set is prodomantly UK and US where drinking alcohol is very common regardless of traits of a person making it not a useful predictor.